Below are 100 randomly selected rows from the dataset.
The table below shows several metrics calculated against the various columns/variables. These metrics include: the number of unique values, number of NAs, the maximum value, the minimum value, and the mean/average.
33,407 prospectids appear multiple times, accounting for 85,349 rows in the dataset.33.8% of the values for that variable. May want to look at a cummulative view of some kind.A closer look at dna_visittrafficsubtype shows that many of the subtypes are rarely found in this dataset. Grouping or combining these in a meaningful manner may help, but unfortunately I doubt I have sufficient information or experience to group the levels of this variable.
After removing the outlier dates (noted above) for ordercreatedate we can better see the general trend.
After removing the NAs from dnatestactivationdayid we can better see the general trend.
Variance appears to tighten up in 2016-2017 and the obvious drop in late 2016 to 2017 will cause problems for most models. Forecasting or predicting could prove difficult if the model isn’t able to account for the sudden drop.
A more detailed view of this daily xsell conversion may help us understand what is influencing this behavior and how that might affect model construction.
I must not understand the regtenure column yet.